Introduction to Open Data Science - Course Project

1. About the project

I’m excited for this course since I think I already learned a bunch of new things at the installation/setup phase, although I use these key tools of the course, R (studio) and GitHub, already on a daily basis. Since I’m very much self-learned, it will be intriguing to learn new things about these tools and how to use them in a more sensible manner perhaps. Especially learning more about the crosstalk between R (studio) and GitHub will be very useful for me, while I’m also excited to dive into the other contents of the course.

To be honest, I didn’t go over the R for Health Data Science in detail yet, just merely browsed but I think it will be a great source of information regarding the later parts of the course. I’m quite familiar with R basics and plotting in R (often with ggplot2), but to be honest, I rarely use other tidyverse -related packages like dplyr anymore since I’m pretty much self-learned R user and usually I find other ways to execute things which I find to be more intuitive than the dplyr commands. But maybe I’ll be converted to a fan after the course, who knows!

My GitHub page can be found at my IODS-project GitHub page.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Tue Nov 21 15:53:39 2023"

2. Regression analysis

date()
## [1] "Tue Nov 21 15:53:39 2023"

Loading necessary R packages

library(dplyr)
library(readr)
library(ggplot2)
library(GGally)

2.1 Learning 2014 dataset

# reading in the learning 2014 dataset and saving it as a data frame in a variable called "data"
data <- as.data.frame(read_csv("/Users/sannimoi/Documents/courses/IODS/IODS-project/data/learning2014.csv", show_col_types = FALSE))

# checking the dimensions and structure of the data
dim(data)
## [1] 166   7
# and structure of the data
str(data)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : num  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num  25 12 24 10 22 21 21 31 24 26 ...

The original learning 2014 dataset was collected by Kimmo Vehkalahti in 2014-2015, which includes answers from 183 students to a variety of questions related to their learning and studying habits, as well as teaching on a course named Introduction to Social Statistics. The original dataset contained 60 variables and 183 observations (students), while the current dataset was processed using the original one and it includes 166 observations, and seven variables based on the dimensions and structure visible above. Checking the structure further reveals that the included variables are gender, age, attitude, deep, stra, surf, and points. While age and gender columns may be easy to understand, the rest of the variables may require a bit of clarification: attitude variable describes the global attitude towards statistics, deep variable is the average value from 12 questions related to deep learning, stra variable is the the average value from 8 questions related to strategic learning, surf variable is the the average value from 12 questions related to surface learning, and points refers to exam points. The deep, strategic and surface learning questions were all measured on the Likert scale (1-5, 1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, 5= strongly agree).

2.2 Graphical overview of the data

# printing a graphical overview of the data, coloring the distributions by gender
ggpairs(data, mapping = aes(col=gender), lower = list(combo = wrap("facethist", bins = 20)))

In the graphical summary above, we can appreciate the distributions of each variable in the data and the relationships between the different variables. The data points, distributions and correlations are coloured by gender (red for female and blue for male students). The plots including gender look a bit different from the others, since it is the only variable in the data that is not numeric.

Overall, the number of female students who participated in the course is almost double compared to male counterparts, which needs to be kept in mind when interpreting the results. But overall, based on the distributions, the female students seem to be a bit younger on average, their global attitude to statistics scores lower, while they record slightly higher values for the strategic and surface learning questions compared to their male counterparts on average. In contrast, the deep learning and exam point distributions seem quite similar between the genders.

Regarding the correlations between the different variables, the results differ quite a bit depending on the gender. For instance, interestingly, the values from surface learning questions seemed to correlate negatively with both attitude and deep learning questions, but these correlations seem to be very specific to the male participants. Moreover, age seemed to have quite minimal correlations with any of the other variables, yet the points of male participants seemed to be negatively correlated with exam points. However, no conrete conclusions can be made due to the evident low number of male students of higher age. Most notably, attitude seems to have the highest positive correlation with exam points regardless of the gender.

2.3 Fitting regression models and interpreting the results

To explore further which variables seem to associate with the exam points variable, a regression model is fitted to the data with three explanatory variables. I chose attitude, stra and surf as the three explanatory variables since they seem to have the highest absolute correlation values with the exam points based on the summary plot above.

# creating a regression model with multiple explanatory variables: for exam points attitude, stra and surf 
my_model2 <- lm(points ~ attitude + stra + surf, data = data)
# printing out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The model is essentially describing how well the combination of attitude, stra and surf variables predict the outcome variable, exam points. In the printed summary above, we first have the description of residuals which essentially describe the difference between the actual observed values and the values estimated by the model, including the minimal and maximal differences, 1st and 3rd quantile values, and the median difference, which is about 0.5 in this case, so quite close to zero, which is good since the residuals seem to be quite symmetrical.

The coefficients table, in turn, describes the coefficients of the model which are basically two unknown constants of the linear model, intercept and slope. So essentially based on these calculated terms and the explanatory variables, we could estimate the outcome variable, i.e., exam points. The p-values are visible in the last column, and they essentially describe the results from a significance test testing the null hypothesis that either the intercept or the slope is 0. The first row in the coefficients table describes the intercept variable, essentially describing the estimated exam points if the values for the explanatory variables would be the average ones in the dataset. For the intercept, we reach a p-value of 0.00322 which is significant. In addition, in the table we have the standard error and t-value columns, which values are related since t value is the coefficient divided by standard error. Standard error essentially captures standard deviation, and we would like that to be as small as possible compared to the coefficient since it describes the level of uncertainty of the coefficient. In contrast, for a confident model, we would like to see a t-value as large as possible since a large t-value would indicate that the standard error is small in comparison to the coefficient. Here, the standard error seems to be relatively high, about 3.7 compared to our coefficient of about 11, and therefore the t-value also remains smaller than 3.

Since we have multiple explanatory variables, we have also a row for each of them describing the slope constant estimates based on each, as well as standard errors, t-values and p-values for each of these. We can see that the surf parameter (surface learning) is the only one with seemingly negative correlation with exam points (negative slope), and for it we have highest standard error as well as the smallest t-value, while accordingly the p-value for it is nearly 0.5, meaning that the any relationship between it and exam points has an almost 50% chance of arising just by chance. Therefore, I chose to fit the model again without the surf variable.

# creating a regression model with multiple explanatory variables: for exam points attitude and stra
my_model2 <- lm(points ~ attitude + stra, data = data)

summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Based on the output, we can see that the p-value for the whole model is now 0.00025, so more significant than for the previous model, yet the p-value for the stra explanatory variable remains non-significant at 0.089 so therefore I will fit the model again without it.

# creating a regression model with one explanatory variable, attitude, for exam points
my_model2 <- lm(points ~ attitude, data = data)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
qplot(attitude, points, data = data) + geom_smooth(method = "lm")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Now we can observe from the summary, that the model has further improved and reached a much more significant p-value for the intercept at 1.95e-09 and for the slope the p-value is almost as significant at 4.12e-09. Moreover, the standard error is now smaller and t-value has also improved as it is over 6 for both intercept and slope.

Also, I plotted a scatterplot above to further describe the relationship of the two variables, the blue line capturing the estimated regression model. We can see that higher the score on the attitude scale, i.e., the explanatory variable, seems to positively correlate with the exam points, i.e., the outcome variable. This positive correlation can also be seen in the summary since the slope constant for the model is > .

Moreover, from the summary we can check the residual standard error which describes the quality of the fit, and basically this value describes the error term that every linear model is expected to contain. For the present model, it seems to be 5.32, resulting in about 46% percentage error. The degrees of freedom describe the data points used in the model fit, here 164. The F statistic, in turn, is a good indicator of whether there is a relationship between the predictor and response variables, and the following p-value is derived essentially from a test where the null hypothesis is that there is no relationship. The F statistic has steadily gotten bigger and p-value smaller when removing the surf and stra explanatory variables, so the model has improved. Since we have a relatively big dataset, we can at least reject the null hypothesis, i.e., that there is no relationship between the variables, for all the models, and most confidently for the last model only including attitude as an explanatory variable.

The R-squared statistic in the model describes how well the model fits the actual observed data, essentially depicting the proportion of variance and measuring the linear relationship between the predictor variable, attitude, and target variable, exam points. Basically in the present model, roughly 19% of the variance of the exam points variable can be explained by the attitude score.

2.4 Diagnostic plots

The linear regression model includes some assumptions about the data, so next we will plot a few diagnostics plots to interpret if those assumptions apply to the present dataset.

# generating selected diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
plot(my_model2, which=c(1,2,5))

The linear model assumes a linear relationship between the explanatory and outcome variables, and this assumption can be checked from the residuals vs fitted plot. There seems to be a linear horizontal line without distinct patterns, so indeed the assumption of a linear relationship seems to apply here.

Secondly, the residual errors are assumed to be normally distributed, which assumption can be evaluated based on the normal Q-Q plot. The residuals points should follow the straight dashed line, which seems to apply here relatively well at least for majority of values. However, some discrepancy can be seen for very small and high values, which is quite normal, though.

Furthermore, the model assumes that residuals have a constant variance, and hence we would like to check that there are no cases in the dataset that have extreme values that might highly affect the regression results if excluded/included. This can be evaluated based on the residuals vs leverage plot, in which the 3 most extreme points are highlighted. We have a few cases which seem to exceed 3 standard deviations which represent potential outliers, and it might be useful to test the model fit without these.

Based on these observations, it can be concluded that the model assumptions apply relatively well to the present dataset.

2.5 Summary

In summary, we can confidently say that there seems to be a relationship between the attitude and exam variables since we get a very significant p-value for the model. However, the p-value should not to be interpreted in such a way that the estimation of exam points would be highly confident if we know the attitude score, since we have to acknowledge that only less than 20% of the variability of the exam points variable is explained by the attitude.


3. Logistic regression

date()
## [1] "Tue Nov 21 15:53:45 2023"

Loading necessary R packages

library(boot)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)

3.1 Alcocol consumption dataset

Reading in the joined student alcohol consumption dataset.

# reading in the dataset
alc <- read.csv("/Users/sannimoi/Documents/courses/IODS/IODS-project/data/alc.csv")
# checking the names of the variables, i.e., columns
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The current dataset was retreived from http://www.archive.ics.uci.edu/dataset/320/student+performance and it contains information about performance of Portuguese secondary education students in two subjects, math and Portuguese. For the present purpose, we have combined the two datasets, and read that modified data in. There are several variables that can be easily understood based on the column name, like school, sex, age, address etc. But especially Dalc and Walc variables are important here, since they describe alcohol consumption on the workdays and weekends, respectively, while alc_use is the mean value of the two and high_use column is set to TRUE if alc_use exceeds 2.

3.2 Hypotheses

The purpose of this analysis is to study the relationships between high/low alcohol consumption and some other variables in the data. I have chosen the following variables to study:

  • failures - number of past class failures (numeric: n if 1<=n<3, else 4)

  • studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

  • famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

  • absences - number of school absences (numeric: from 0 to 93)

I hypothesize that weekly study time and and quality of family relationships would negatively correlate with alcohol consumption, i.e., that a higher value for these factors would usually associate with lower alcohol consumption. Regarding the number of school absences and past class failures, I assume the opposite, i.e., that there would be a positive correlation between higher alcohol consumption and higher number of absences/failures.

3.3 Variable distributions and relationships

# drawing a bar plot of each chosen variable
gather(alc[,c("failures", "studytime", "famrel", "absences", "alc_use")]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

# and then summarizing the number of students that have been classified to the high alcohol usage group
alc %>% group_by(high_use) %>% summarise(count = n())
## # A tibble: 2 × 2
##   high_use count
##   <lgl>    <int>
## 1 FALSE      259
## 2 TRUE       111

In general, based on the plots and table above, we can observe that most students have a reasonable number of absences, only a few standing out with more than 15, and also that vast majority of students haven’t failed any classes in the past. Majority of students also fall into the category of low alcohol consumption (high use = FALSE & less than 2 portions based on the barplot) yet still about 30% have been classified into the high alcohol use group. Most students also report good/excellent quality family relationships, while study time per week, in turn, seems to average around 2-5 hours per week, as category 2 stands out as the most common one there, studying less than that being the reality for about 100 students, while less than 100 report studying more than 5 hours.

# creating cross-tabulations describing the numbers in each candidate explanatory variable category and the high/low alcohol usage category
table(high_use = alc$high_use, prediction = alc$failures)
##         prediction
## high_use   0   1   2   3
##    FALSE 238  12   8   1
##    TRUE   87  12   9   3
table(high_use = alc$high_use, prediction = alc$studytime)
##         prediction
## high_use   1   2   3   4
##    FALSE  56 128  52  23
##    TRUE   42  57   8   4
table(high_use = alc$high_use, prediction = alc$famrel)
##         prediction
## high_use   1   2   3   4   5
##    FALSE   6   9  39 128  77
##    TRUE    2   9  25  52  23
table(high_use = alc$high_use, prediction = alc$absences)
##         prediction
## high_use  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 16 17 18 19 20 21 26 27
##    FALSE 50 37 40 31 24 16 16  9 14  5  5  2  3  1  1  0  0  1  0  2  1  0  0
##    TRUE  13 13 16  7 11  6  5  3  6  6  2  3  4  1  6  1  1  1  1  0  1  1  1
##         prediction
## high_use 29 44 45
##    FALSE  0  0  1
##    TRUE   1  1  0
# plotting boxplots to study the relationship between the alcohol usage (high or not) and the chosen variables

# drawing boxplots comparing values between the high vs low alcohol usage groups
ggplot(alc, aes(x = high_use, y = failures, col=high_use)) + geom_boxplot() + ggtitle("Past class failures") + xlab("High alcohol use")

ggplot(alc, aes(x = high_use, y = studytime, col=high_use)) + geom_boxplot() + ggtitle("Study time") + xlab("High alcohol use")

ggplot(alc, aes(x = high_use, y = famrel, col=high_use)) + geom_boxplot() + ggtitle("Quality of family relationships") + xlab("High alcohol use")

ggplot(alc, aes(x = high_use, y = absences, col=high_use)) + geom_boxplot() + ggtitle("Absences from school") + xlab("High alcohol use")

From the tables and plots above, we can get some idea whether the chosen variables seem to have any relationship with the high alcohol consumption. As I assumed, the students reporting high alcohol consumption study less on average based on study time and report family relationships of lower quality on average. Also, as I hypothesized, the number of absences from school seemed to be higher on average in the group with high alcohol use. The relationship of failed classes, however, is difficult to evaluate based on the boxplot, since so few students had previously failed classes, yet some indication of a correlation can be observed from the table, as more students that have 2 or more failed classes also consume high amount of alcohol.

3.4 Logistic regression

To statistically explore the relationship between the chosen explanatory variables and the high/low alcohol consumption target variable, a logistic regression model will be fitted.

# fitting the model, using failres, studytime, famrel and absences as the explanatory variables
model <- glm(high_use ~ failures + studytime + famrel + absences, data = alc, family = "binomial")
# printing the summary
summary(model)
## 
## Call:
## glm(formula = high_use ~ failures + studytime + famrel + absences, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0531  -0.7954  -0.6253   1.0826   2.1737  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.60734    0.61872   0.982 0.326292    
## failures     0.49010    0.20416   2.401 0.016370 *  
## studytime   -0.48901    0.15962  -3.064 0.002187 ** 
## famrel      -0.24724    0.12916  -1.914 0.055598 .  
## absences     0.07417    0.02246   3.303 0.000958 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 409.94  on 365  degrees of freedom
## AIC: 419.94
## 
## Number of Fisher Scoring iterations: 4
# then printing the odds ratios of the coefficients
exp(coef(model))
## (Intercept)    failures   studytime      famrel    absences 
##   1.8355483   1.6324760   0.6132323   0.7809515   1.0769908
# and then printing the confidence intervals for the coefficients 
confint(model)
## Waiting for profiling to be done...
##                   2.5 %       97.5 %
## (Intercept) -0.60724157  1.826969798
## failures     0.09542278  0.902367559
## studytime   -0.81145054 -0.184130907
## famrel      -0.50161742  0.006600938
## absences     0.03228991  0.120768107

Looking at the summary of the logistic regression model, we can interpret the residuals similarly to linear regression, as they essentially describe the difference between the actual observed values and the values estimated by the model, including the minimal and maximal differences, 1st and 3rd quantile values, and the median difference, which is about -0.6253 in this case. Moreover, the coefficients table describes the estimated intercept term and the slope terms for each explanatory variables, while p-values for each are also given. The p-value for the intercept remains quite high at 0.33, while p-values for all the slope terms except famrel are significant. The null deviance describes the value when only taking intercept term into account, while residual deviance is the value when all variables are taken into account, so the higher the difference, better the model. The difference doesn’t seem very notable in this case. The AIC, in turn, would be useful in comparing the fit of multiple regression models, as it could help to find the model that explains the most variation in the data, the lower the AIC the better.

The odds ratios can essentially be interpreted as the percentage increase in the odds of an event, so basically anything less than zero have a negative association, so here study time and quality of family relationships, so if they increase, they decrease the odds of high alcohol consumption, while the odds ratios for failures and absences are more than one, and thus if they increase, also increases the probability of high alcohol consumption. Odds ratio of 1 would correspond to 0.5 probability, so e.g., the odds ratio of failures corresponds to 1.632/(1+1.632)=0.62 probability, while study time would correspond to 0.613/(1+0.613)=0.38 probability, essentially meaning how much would the probability of high alcohol consumption increase/decrease if the value of the explanatory variable is increased by one unit. Based on these odds ratios, all of the explanatory variables seem to associate with alcohol consumption as I assumed, studytime and famrel correlating negatively, while failures and absences had positive correlation with alcohol consumption.

The confidence intervals, in turn, describe the range of estimated values for a parameter, and basically here we can state that we can be 95% confident that the slope for each explanatory variable is between the intervals visible above. E.g., it is likely that the slope for failures id between 0.095 and 0.902, which seems to be quite a wide scale, also visible in the p-value for the slope (0.016370), which is higher compared to e.g. absences, although the odds ratio for absences is not so high as for failures. For absences, in turn, the confidence interval is between 0.032 and 0.121, and the p-value is highly significant at 0.000958, which highlights that to explore the relationship between an explanatory and outcome variable, one should check both the odds ratio and the confidence estimates of the relationship between the two variables, like the p-value and confidence intervals.

Since famrel variable, did not have a significant relationship to alcohol consumption, removing that from the model and testing the predictive power of the improved model next.

3.5 Exploring the predictive power of the logistic regression model

# since quality of family relationships did not seem to have that significant relationship with alcohol consumption, fitting the model without it
model2 <- glm(high_use ~ failures + studytime + absences, data = alc, family = "binomial")

# predicting the probability of high_use based on the model
probabilities <- predict(model2, type = "response")

# adding the predicted probabilities to alc
alc <- mutate(alc, probability = probabilities)

# using the probabilities to make a prediction of high_use
# classifying probability > 0.5 as a cutoff for high use
alc <- mutate(alc, prediction = probability > 0.5)

# generating a 2x2 cross tabulation
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   246   13
##    TRUE     86   25
# calculating for how many students the prediction is wrong
n_wrong <- 0
for (i in 1:nrow(alc)) {
  if (alc$high_use[i]!=alc$prediction[i]) {n_wrong <- (n_wrong+1)}}
n_wrong
## [1] 99
# since there are 370 students in total, the training error can be calculated n_wrong/370
n_wrong/370
## [1] 0.2675676
# then checking how well would random predictions work
# doing the following steps 100 times to derive an estimate of how well a random classification performs compared to our model
random_errors <- vector()
for (j in 1:100) {
  # sampling the high use column to derive a new randomized column with the same ratio of TRUE/FALSE  values
  alc["high_use_random"] <- sample(alc$high_use)
  # then calculating how many wrong predictions we get from this random classification
  n_wrong <- 0
  for (i in 1:nrow(alc)) {
    if (alc$high_use[i]!=alc$high_use_random[i]) {n_wrong <- (n_wrong+1)}}
  # since there are 370 students in total, the training error can be calculated n_wrong/370
  random_errors <- append(random_errors, n_wrong/370)
}
# then checking the mean, median and range of the error rate of these randomized classifications
mean(random_errors)
## [1] 0.4253514
median(random_errors)
## [1] 0.427027
range(random_errors)
## [1] 0.3729730 0.4810811

Based on the model, we can say pretty confidently that the chosen variables in the final model, weekly study time, number of past class failures and school absences, seem to have a relationship with the outcome variable, alcohol consumption. However, the predictive power of the model, i.e., predicting alcohol consumption based on the variables, is by no means ideal, since the classification to high/low consumption group fails almost 27% percent of the time (= training error), which is still somewhat better than what you would get by grouping the students by random (on average 42% error rate from 100 randomizations) yet far from something we would call a confident predictive model.

4. Clustering and classification

date()
## [1] "Tue Nov 21 15:53:47 2023"

Loading necessary R packages

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(MASS)
library(corrplot)
library(GGally)

4.1 Boston suburban housing values dataset

# loading in the Bostson data from the MASS dataset
data("Boston")
# checking the dimensions and structure
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The Boston dataset contains information about housing values in the suburbs of Boston, and there are in total 506 observations (rows in the data frame) and 14 variables (columns in the data frame), most of the variables being reported in numerical values. These variables include various pieces of information about the studied suburbs, like the per capita crime rate (crim), average number of rooms per dwelling (rm), median value of owner-occupied homes in $1000s (medv) etc. All the variables are explained here.

4.2 Graphical overview of the data

# for plotting purposes, changing the chas variable into TRUE/FALSE scale
# 1 -> TRUE (i.e., suburb tract bounds river)
# 0 -> FALSE (i.e., suburb tract does not bound river)
Boston2 <- gsub(1, "TRUE", Boston[,4])
Boston2 <- gsub(0, "FALSE", Boston2)
Boston["chas"] <- Boston2

# printing out the data summary
summary(Boston)
##       crim                zn             indus           chas          
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Length:506        
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   Class :character  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Mode  :character  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14                     
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10                     
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74                     
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# printing a graphical overview of the data
ggpairs(Boston, lower = list(continuous = wrap("points", alpha = 0.3, size=0.1), combo = wrap("dot", alpha = 0.3, size=0.1)), upper=list(continuous=wrap("cor", size=2.1))) + theme(axis.text=element_text(size=5), axis.title = element_text(size=5))

In the overview above, we can see that some variable distributions, like crim (per capita crime rate by town), zn (proportion of residential land zoned for lots over 25,000 sq.ft.), and dis (weighted mean of distances to five Boston employment centres) are highly skewed to the left, while some variables, like indus (proportion of non-retail business acres per town) and tax (full-value property-tax rate per $10,000) seem to have two distinct peaks. Some variables, like rm (average number of rooms per dwelling) seem to be more normally distributed. These observations seem reasonable, since it is quite expected that e.g., only a small minority of residential lots would be very large in a big city like Boston (skewness of the zn variable), while it is more likely that proportion of non-retail business acres per town would be quite town specific and those towns with a lot of industrial buildings, for instance, would have minor residential buildings and vice versa (two peaks of the indus variable). Moreover, it is expected that variables like average number of rooms per dwelling would be quite normally distributed since extremes (very low or high number of rooms) are unreasonable for majority of residents, while the numbers closer to the average accumulate as the peak. Regarding the only non-numerical chas variable, describing if the suburb tract bounds river or not, the vast majority of suburbs do not seem to be river-bound, which is quite expected.

Regarding the relationships between the variables, it seems that all of the numerical variables highly correlate with each other, the strongest association being the positive correlation between tax (full-value property-tax rate per $10,000) and rad (index of accessibility to radial highways), which seems intuitive that the properties closest to the highways have the highest tax rate. Regarding the only non-numerical variable, chas, describing if the suburb tract bounds river or not, there are no direct correlation values but some trends can be distinguished from the plots, e.g., that there seem to be less suburbs with higher crime rate while the age of the properties seem to be on average higher compared to non-river bound counterparts, which observations seem intuitive since the suburbs closest to the river represent most likely the (historical) downtown of the city with high-value homes.

Plotting below also a visualization of the correlation matrix to allow for a bit closer review of the correlations between the variables.

# changing the chas variable back to original 1/0 values
# TRUE -> 1 (i.e., suburb tract bounds river)
# FALSE -> 0 (i.e., suburb tract does not bound river)
Boston2 <- gsub("TRUE", 1, Boston[,4])
Boston2 <- gsub("FALSE", 0, Boston2)
Boston["chas"] <- as.numeric(Boston2)

# calculating the correlation matrix
cor_matrix <- cor(Boston) 
# and visualizing it
corrplot(cor_matrix, method="circle")

In the correlation matrix visualizations we can distinguish the highest correlations between the variables in the dataset, the positive ones being visualized in blue and negative correlations in red. For instance, nox (nitrogen oxides concentration) correlates negatively with dis (weighted mean of distances to five Boston employment centres) and positively with indus (proportion of non-retail business acres per town), which is quite expected since pollution (like nitrogen oxides) is likely to be highest in the industrial areas, which also likely represent many of the employment centers.

4.3 Scaling the dataset

Next, to standardize the dataset, we will scale the values, i.e., subtract the column means from the corresponding columns and divide the difference with standard deviation.

# centering and standardizing variables
boston_scaled <- as.data.frame(scale(Boston))

# summarizing the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

By scaling, we have essentially centered the values around 0, so the mean of each variable is now 0. Next, we will create a categorical variable of the crime rate, generating 4 classes based on quantile values: low, intermediate low, intermediate high and high.

# determining the quantiles of the crim (crime rate per capita) variables
quants <- quantile(boston_scaled$crim)
# creating a categorical variable called 'crime'
crime <- cut(boston_scaled$crim, breaks = quants, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# and checking the crime variable distribution
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# then removing original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# and adding the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

4.4 Train and test sets

Next, we are determining the train and test sets for linear discriminant analysis, so that 80% of observations are in the train set and 20% in the test set.

# dividing the data into train and test sets
set.seed(14)

# 80% allocated to the train set
train <- sample(nrow(boston_scaled), round(0.8*506))
# and the rest, 20% to the test set
test <- setdiff(1:nrow(boston_scaled), train)

# creating tables with only the train and test sets
train <- boston_scaled[train,]
test <- boston_scaled[test,]

4.5 Linear discriminant analysis

Next, we are fitting the linear discriminant model based on the train set, and using it to predict the crime category for the test set.

# conducting the linear discriminant analysis for the train set
# using crime as the outcome variable and all other variables as the predictor variables
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2395062 0.2567901 0.2567901 0.2469136 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       0.88018684 -0.9415937 -0.10997442 -0.8527319  0.42189046 -0.8760001
## med_low  -0.07228021 -0.2786005 -0.04518867 -0.5769003 -0.16683471 -0.4154389
## med_high -0.37922279  0.1835589  0.18195173  0.4087296  0.07424198  0.4046517
## high     -0.48724019  1.0149946 -0.03610305  1.0491662 -0.47816139  0.7977622
##                 dis        rad        tax     ptratio       black       lstat
## low       0.7912327 -0.6917945 -0.7302380 -0.43374706  0.38169387 -0.75045462
## med_low   0.4108086 -0.5434660 -0.4506348 -0.05185552  0.31564889 -0.13900995
## med_high -0.4055865 -0.4142635 -0.3135958 -0.30679176  0.07832721  0.02678365
## high     -0.8417971  1.6596029  1.5294129  0.80577843 -0.74492067  0.89683260
##                 medv
## low       0.49143415
## med_low  -0.03712692
## med_high  0.14269557
## high     -0.67834465
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.14094159  0.583571762 -1.07527346
## indus    0.04862218 -0.388509504  0.47793431
## chas    -0.08648985 -0.008871757  0.10195962
## nox      0.26003892 -0.865688531 -1.15397700
## rm      -0.09711108 -0.044237317 -0.12791697
## age      0.23864839 -0.360282856 -0.07065071
## dis     -0.14600415 -0.308393676  0.57267244
## rad      3.41164504  0.713321674 -0.18377522
## tax     -0.03355981  0.309963183  0.56766834
## ptratio  0.13083968  0.053636214 -0.26691521
## black   -0.14812971  0.041315399  0.11953790
## lstat    0.14592880 -0.136928628  0.40634740
## medv     0.12599482 -0.366917135 -0.10819431
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9515 0.0350 0.0135
# defining the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, x1 = myscale * heads[,choices[1]], y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), cex = tex, col=color, pos=3)}

# defining target classes as numeric, 1 - low, 4 - high
classes <- as.numeric(train$crime)

# plotting the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

# first determining the correct crime categories of the test set
correct_cat <- test$crime
# and removing the variable from the table
test <- dplyr::select(test, -crime)

# predicting classes of the test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_cat, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       19      10        1    0
##   med_low    3      15        4    0
##   med_high   0       8       13    1
##   high       0       0        1   26

After fitting the model with the train data, and applying it to the test data, we can see that 19+15+13+26=73 categories have been correctly predicted, resulting in a success rate of 73/101=0.72. Most of the wrong predictions are also only 1 category “away” from the correct one, so i.e., one category lower or higher. Only 1 prediction is two categories away from the correct one; the one for which the correct category is low, but prediction is med_high. Therefore, it seems that the model is working relatively well to predict the correct crime category, and if the prediction fails, it is rarely very far from the accurate one. However, regarding the present, relatively small dataset, it can have a high impact on the model which observations are chosen for the train set and test set, and whether the proportions of the different categories represented in both train and test sets represent the actual data well.

4.6 K-means clustering

Next, the Boston housing dataset will be reloaded and standardized, and distances between the observations calculated.

# loading in the Bostson data from the MASS dataset
data("Boston")

# centering and standardizing variables
boston_scaled <- as.data.frame(scale(Boston))

# creating a distance matrix using the Euclidean distance
dist_eu <- dist(boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# as well as creating a distance matrix using the Manhattan distance
dist_eu <- dist(boston_scaled, method="manhattan")
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Next, to further study the similarity of the observations, k-means clustering will be conducted. The unsupervised method will assign the observations into clusters based on their similarity.

set.seed(14)
# conducting the k-means clustering, starting with 4 clusters
km <- kmeans(boston_scaled, centers = 4)

# visualizing the dataset with clusters indicated by colours
ggpairs(boston_scaled[c("crim", "zn", "indus", "nox", "age", "dis", "rad", "tax", "lstat", "medv")], mapping = aes(col=factor(km$cluster), alpha=0.3), lower = list(combo = wrap("dot", alpha=0.3, size=0.1)), upper=list(continuous=wrap("cor", size=2.1))) + theme(axis.text=element_text(size=5), axis.title = element_text(size=5))

It seems that 4 clusters may be a bit too much since the distributions of the variables by clusters seem to overlap quite notably in many of the plots. Therefore, testing out division to 2 and 3 clusters next.

set.seed(14)
# conducting the k-means clustering with 2 clusters
km <- kmeans(boston_scaled, centers = 2)

# visualizing the dataset with clusters indicated by colours
ggpairs(boston_scaled[c("crim", "zn", "indus", "nox", "age", "dis", "rad", "tax", "lstat", "medv")], mapping = aes(col=factor(km$cluster), alpha=0.3), lower = list(combo = wrap("dot", alpha=0.3, size=0.1)), upper=list(continuous=wrap("cor", size=2.1))) + theme(axis.text=element_text(size=5), axis.title = element_text(size=5))

set.seed(14)
# conducting the k-means clustering with 3 clusters
km <- kmeans(boston_scaled, centers = 3)

# visualizing the dataset with clusters indicated by colours
ggpairs(boston_scaled[c("crim", "zn", "indus", "nox", "age", "dis", "rad", "tax", "lstat", "medv")], mapping = aes(col=factor(km$cluster), alpha=0.3), lower = list(combo = wrap("dot", alpha=0.3, size=0.1)), upper=list(continuous=wrap("cor", size=2.1))) + theme(axis.text=element_text(size=5), axis.title = element_text(size=5))

Division to 2 clusters seemed to work well regarding some variables, like indus and taxt as there seems to be 2 quite distinct peaks. However, for some variables, the distributions per cluster are quite wide, like lstat and medv, which may indicate that division into 2 clusters might not optimally capture the complexity of the data. When dividing the data into 3 clusters, in turn, the clusters seem to quite nicely be distinguishable from each other regarding many variables, like nox, and e.g., the distributions of medv and lstat are not as wide anymore. Therefore, it seems that 3 clusters may be the best choice out of the different number of clusters.

Looking at the last plot with 3 clusters more closely, we can see that blue cluster seems quite distinct in many aspects, i.e., it seems to have clearly the highest indus (proportion of non-retail business acres per town), nox (nitrogen oxides concentration), rad (index of accessibility to radial highways) and tax (full-value property-tax rate per $10,000) values on average, while the dis (weighted mean of distances to five Boston employment centres) and medv (median value of owner-occupied homes in $1000s). All of these observations seem to indicate that these areas cluster together due to many factors that indicate them being more industrial as these suburbs are more likely to e.g., have a lot of industrial buildings, and higher pollution, while they seem to represent the areas closest to the employment centers since the industrial areas likely hold a lot of jobs and access to highways is likely important to the industries too. The high tax rate might also indicate a lot of industrial use of the buildings as there are likely bigger institutions that pay more tax for the bigger properties than the average home owner. Higher crime rates also seem to accumulate to this cluster, likely indicating that the residential areas of these suburbs are inhabited more likely by people with lower education level on average (higher lstat) and likely therefore lower income.

Regarding the other two clusters, they seem to represent less-industrial areas. As the green cluster is resembles the blue one a bit more compared to the red, I would interpret that the green cluster likely represents areas closer to downtown Boston while the red cluster represents more distant suburbs. These clusters do not seem to differ so much in terms of crime rate or tax. However, the red cluster seems to have the highest number of large residental properties by area (zn, proportion of residential land zoned for lots over 25,000 sq.ft), lowest proportion of non-retail business acres per town (indus) on average, lowest nox (nitrogen oxides concentration) and highest dis (weighted mean of distances to five Boston employment centres), all suggesting that the areas are likely further away from the heart of the city. Also, the percentage lower status of the population and the age of the homes seem the lowest in these areas, while the median value of homes seems to be the highest on average, also indicating that perhaps these areas more likely house the upper class since the properties are on average newer and bigger and owned by people more highly educated. The green cluster, in turn, seems to be the ‘middle one’ so likely represents on average the non-industrial suburbs closer to downtown Boston, with e.g., pollution higher compared to the red cluster yet lower compared to the blue one, but on average with smaller distance to the employment centers (dis) compared to the red cluster.

All in all, these 3 clusters seem to represent well 3 distinct classes of Boston suburbs, and comparing the distributions also allows one to hypothesize about the types of areas these clusters likely represent on average.

(more chapters to be added similarly as we proceed with the course!)